首页> 外文OA文献 >Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU
【2h】

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

机译:编译器辅助工作负载合并以实现高效动态   GpU上的并行性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

GPUs have been widely used to accelerate computations exhibiting simplepatterns of parallelism - such as flat or two-level parallelism - and a degreeof parallelism that can be statically determined based on the size of the inputdataset. However, the effective use of GPUs for algorithms exhibiting complexpatterns of parallelism, possibly known only at runtime, is still an openproblem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs.By making it possible to launch kernels directly from GPU threads, this featureenables nested parallelism at runtime. However, the effective use of DP muststill be understood: a naive use of this feature may suffer from significantruntime overhead and lead to GPU underutilization, resulting in poorperformance. In this work, we target this problem. First, we demonstrate how anaive use of DP can result in poor performance. Second, we propose threeworkload consolidation schemes to improve performance and hardware utilizationof DP-based codes, and we implement these code transformations in adirective-based compiler. Finally, we evaluate our framework on two categoriesof applications: algorithms including irregular loops and algorithms exhibitingparallel recursion. Our experiments show that our approach significantlyreduces runtime overhead and improves GPU utilization, leading to speedupfactors from 90x to 3300x over basic DP-based solutions and speedups from 2x to6x over flat implementations.
机译:GPU已被广泛用于加速计算,该计算展现出并行性的简单模式(例如平面或两级并行性)以及可基于输入数据集的大小静态确定的并行度。但是,将GPU有效地用于表现出并行性复杂模式的算法(可能仅在运行时才知道)仍然是一个难题。最近,英伟达(NVIDIA)在其GPU中引入了动态并行(DP),这使得可以直接从GPU线程启动内核成为可能,从而在运行时实现了嵌套并行性。但是,必须始终了解DP的有效使用:天真地使用此功能可能会遭受大量运行时开销,并导致GPU使用不足,从而导致性能不佳。在这项工作中,我们针对此问题。首先,我们演示过分使用DP会导致性能下降。其次,我们提出了三种工作负载合并方案,以提高基于DP的代码的性能和硬件利用率,并在基于指令的编译器中实现这些代码转换。最后,我们在两类应用程序上评估我们的框架:包括不规则循环的算法和表现出并行递归的算法。我们的实验表明,我们的方法显着减少了运行时开销并提高了GPU利用率,与基于DP的基本解决方案相比,加速因子从90倍提高到3300倍,与平面实现相比,加速因子从2倍提高到6倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号